Fun and efficient data wrangling in R:

An introduction to the tidyverse

Adrienne Marshall

April 5, 2018

Wrangling data

What do we mean by data wrangling?

wrangling

Lots of possible definitions, but generally means all the transformations you might need to do to get data into a form that is ready for further analysis.

Common tasks

  • Create a new column based on calculations in your data

  • Select only certain parts of your data

  • Do calculations on subsets of your data

  • Modify text data

  • Iterate over your data

  • Merge data from multiple sources

Common tasks

  • Create a new column based on calculations in your data: mutate()

  • Select only certain columns or rows of your data: filter(), select()

  • Do calculations on subsets of your data: group_by(), summarise()

  • Modify text data: stringr functions

  • Iterate over your data: for loops, apply(), purrr::map()

  • Merge data from multiple sources: join()

Before we start…

How would you describe your prior experience with R?

  1. I just downloaded it for the first time.

  2. I use it, but it makes me uncomfortable.

  3. I’m comfortable with it, here to learn about the tidyverse.

How comfortable are you with programming concepts like for loops and writing functions?

  1. No idea what you’re talking about.

  2. I’ve heard of and used these things, but they’re a stretch.

  3. Easy peasy.

Let’s jump in!

  • Disclaimer: I don’t know everything, and there’s always more to learn!

The tidyverse

A collection of R packages designed for data wrangling and visualization, built for tidying your data and working with tidy data.

library(tidyverse)

Why the tidyverse?

Why not the tidyverse?

  • More likely to change than base R

Tidy data

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table. (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)

Tidy data example

An untidy example:

names favorite_food favorite_color
Adrienne burritos yellow
Alex pizza turquoise

Tidy data example

Tidied:

names variable value
Adrienne favorite_food burritos
Alex favorite_food pizza
Adrienne favorite_color yellow
Alex favorite_color turquoise

Is tidy always better?

Sometimes messy is good!

Sometimes messy can be good!

When not to tidy…

  • Data entry

  • Data with matrix operations (e.g., statistical modeling)

  • Nice to know your options

Example data

Read data using read_csv().

ufo <- read_csv("../data/ufo_dat.csv")
## Parsed with column specification:
## cols(
##   datetime = col_character(),
##   city = col_character(),
##   state = col_character(),
##   country = col_character(),
##   shape = col_character(),
##   `duration (seconds)` = col_integer(),
##   `duration (hours/min)` = col_character(),
##   comments = col_character(),
##   `date posted` = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 266 parsing failures.
## row # A tibble: 5 x 5 col     row col                expected               actual file              expected   <int> <chr>              <chr>                  <chr>  <chr>             actual 1  1606 duration (seconds) no trailing characters .5     '../data/ufo_dat… file 2  1653 duration (seconds) no trailing characters .5     '../data/ufo_dat… row 3  1660 duration (seconds) no trailing characters .1     '../data/ufo_dat… col 4  1683 duration (seconds) no trailing characters .5     '../data/ufo_dat… expected 5  2039 duration (seconds) no trailing characters .05    '../data/ufo_dat…
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
ufo1 <- read.csv("../data/ufo_dat.csv")

A word on file structures

  • Many structures work, but you should definitely have one.

  • Consider a “recipe” script.

  • Try to avoid:
    • setwd()
    • read.csv("adrienne/cool_project1/version6/sdfljk.csv")

A word on file structures

file structure

Let’s load some data…

read_csv() vs read.csv()

  • speed difference:
system.time(ufo <- read_csv("../data/ufo_dat.csv"))
##    user  system elapsed 
##   0.525   0.023   0.553
system.time(ufo1 <- read.csv("../data/ufo_dat.csv"))
##    user  system elapsed 
##   2.722   0.038   2.768
  • read_csv() is “lazy and surly” (Wickham).

  • read.csv(): stringsAsFactors = TRUE, uses rownames

The pipe: %>%

  • Read it in your head as “then”
    • as in, select part of your data, then arrange it in order.
  • Use with functions that take a data frame as their first argument.
    • Works with: head()
    • Not with: mean()

Common data wrangling tasks

Rename columns

ufo <- ufo %>% 
  rename("duration_sec" = `duration (seconds)`)

This is the same as:

ufo <- rename(ufo, "duration_sec" = `duration (seconds)`)

Create a new column:

mutate()

ufo <- ufo %>%
  mutate(duration_min = duration_sec/60) %>%
  mutate(duration_hr = duration_min/60)

kable(head(ufo[, c("duration_sec", "duration_min", "duration_hr")]))
duration_sec duration_min duration_hr
2700 45.0000000 0.7500000
7200 120.0000000 2.0000000
20 0.3333333 0.0055556
20 0.3333333 0.0055556
900 15.0000000 0.2500000
300 5.0000000 0.0833333

Where are the quotes?!

ufo <- ufo %>%
  mutate(duration_min = duration_sec/60) %>%
  mutate(duration_hr = duration_min/60)

mutate()

Challenge: Add a column that gives the duration in years. (Hint: there are 8760 hours in a year.)

Challenge answer

ufo <- ufo %>% 
  mutate(duration_years = duration_hr/8760)

Subset your data:

filter(), select()

ufo %>% 
  filter(duration_hr > 10) %>%
  select(datetime, duration_sec, duration_hr, `duration (hours/min)`) %>%
  arrange(desc(duration_hr)) %>%
  head()
## # A tibble: 6 x 4
##   datetime        duration_sec duration_hr `duration (hours/min)`
##   <chr>                  <int>       <dbl> <chr>                 
## 1 10/1/1983 17:00     97836000      27177. 31 years              
## 2 6/3/2010 23:30      82800000      23000. 23000hrs              
## 3 9/15/1991 18:00     66276000      18410. 21 years              
## 4 4/2/1983 24:00      52623200      14618. 2 months              
## 5 8/10/2012 21:00     52623200      14618. 2 months              
## 6 8/24/2002 01:00     52623200      14618. 2 months

31 years?!

comment <- ufo %>% 
  filter(duration_sec > 90000000) %>% 
  select(comments)
print(comment$comments)
## [1] "Firstly&#44 I was stunned and stared at the object for what seemed minutes&#44 but probably was only seconds. My first inclination was to bec"

Challenge: Get only the observations shorter than a duration of your choice. Save the results to a new data frame called short_sight.

Challenge answer

short_sight <- ufo %>% filter(duration_hr < 1)

Do calculations on subsets:

group_by(), summarise()

For example, how many sightings were there per state?

n_per_state <- ufo %>%
  group_by(state) %>% 
  count(sort = T)

kable(head(n_per_state))
state n
ca 9655
NA 5797
wa 4268
fl 4200
tx 3677
ny 3219

Iterate over your data:

for loops, apply(), purrr::map()

When to iterate:

  • Anytime you need to do something more than twice (ish)

  • Examples:
    • downloading many files from the internet
    • analyzing several scenarios
    • calculations for each year in a time series

for loops

  • Advantages: many other programming languages also have for loops.

  • Disadvantages: can be bulky and/or slow compared to other options.

Example for loops

How many UFOs were observed in each state?

states <- unique(ufo$state)
answer_df <- data.frame(state = states, n = NA)

for(i in 1:length(states)){
  temporary_data <- ufo %>% filter(state == states[i])
  answer_df$n[i] <- nrow(temporary_data)
}
kable(head(answer_df))
state n
tx 3677
NA 0
hi 353
tn 1193
ct 968
al 691

Example apply function

How many UFOs were observed in each state?

states <- unique(ufo$state)

count_obs <- function(state_id){
  nrow(ufo %>% filter(state == state_id))
}

counts <- lapply(states, count_obs)

Example map function

This is the tidyverse way…

How many UFOs were observed in each state?

states <- unique(ufo$state)

count_obs <- function(state_id){
  nrow(ufo %>% filter(state == state_id))
}

ans_list <- map(states, count_obs)
ans_vect <- map_int(states, count_obs)

More with map

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
models <- mtcars %>% 
  split(.$cyl) %>% 
  map(function(df) lm(mpg ~ wt, data = df))

** Example stolen directly from R for Data Science: http://r4ds.had.co.nz/iteration.html#the-map-functions

What did we get?

summary(models[[1]])
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1513 -1.9795 -0.6272  1.9299  5.2523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.571      4.347   9.104 7.77e-06 ***
## wt            -5.647      1.850  -3.052   0.0137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.332 on 9 degrees of freedom
## Multiple R-squared:  0.5086, Adjusted R-squared:  0.454 
## F-statistic: 9.316 on 1 and 9 DF,  p-value: 0.01374

Extract coefficients

names(models[[1]])
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Extract coeffients

map(models, "coefficients")
## $`4`
## (Intercept)          wt 
##   39.571196   -5.647025 
## 
## $`6`
## (Intercept)          wt 
##   28.408845   -2.780106 
## 
## $`8`
## (Intercept)          wt 
##   23.868029   -2.192438

** Caveat: May not be a good idea statistically…

Merge data from multiple sources: join()

Are number of UFO sightings per year associated with economic conditions?

join()

First, get economic data: ?economics

kable(head(economics))
date pce pop psavert uempmed unemploy
1967-07-01 507.4 198712 12.5 4.5 2944
1967-08-01 510.5 198911 12.5 4.7 2945
1967-09-01 516.3 199113 11.7 4.6 2958
1967-10-01 512.9 199311 12.5 4.9 3143
1967-11-01 518.1 199498 12.5 4.7 3066
1967-12-01 525.8 199657 12.1 4.8 3018

join()

Make a year column in the UFO dataset. This requires a brief detour into lubridate

library(lubridate)

ufo <- ufo %>% 
  separate(datetime, into = c("date", "time"), sep = " ") %>%
  mutate(date = mdy(date)) %>% 
  mutate(year = year(date))

join()

Let’s see what we got out of that:

ufo %>% 
  select(date, time, year) %>% 
  head() %>% 
  kable()
date time year
1949-10-10 20:30 1949
1949-10-10 21:00 1949
1955-10-10 17:00 1955
1956-10-10 21:00 1956
1960-10-10 20:00 1960
1961-10-10 19:00 1961

join()

We’ll need to know how many UFO sightings there were each year in our dataset.

Challenge: how would you do this?

  • Consider:
    • for loops
    • apply functions
    • map functions
    • group_by() %>% summarise()

join()

ufo_counts <- ufo %>% 
  group_by(year) %>%
  count()

kable(head(ufo_counts))
year n
1906 1
1910 2
1916 1
1920 1
1925 1
1929 1

join()

We’ll also want economic summaries by year.

econ_summary <- economics %>% 
  mutate(year = year(date)) %>% 
  group_by(year) %>% 
  summarise_all(mean) %>%
  ungroup() %>%
  select(-date)

At last: join()

df_new <- left_join(ufo_counts, econ_summary, by = "year") %>% 
  filter(year >= min(econ_summary$year)) 
kable(head(df_new))
year n pce pop psavert uempmed unemploy
1967 188 515.1667 199200.3 12.30000 4.700000 3012.333
1968 220 557.4583 200663.8 11.21667 4.500000 2797.417
1969 155 604.4833 202648.7 10.74167 4.441667 2830.167
1970 147 647.6917 204982.3 12.60833 4.983333 4127.333
1971 130 701.0000 207589.3 13.25833 6.275000 5021.667
1972 158 769.4333 209837.6 12.11667 6.108333 4875.833

Why join?

How do joins work?

joins

Wide to long: melt

What if we wanted to plot all these economic variables at the same time?

Wide to long: melt

library(reshape2)
df_long <- df_new %>% 
  melt(id.vars = c("year", "n"))

kable(head(df_long))
year n variable value
1967 188 pce 515.1667
1968 220 pce 557.4583
1969 155 pce 604.4833
1970 147 pce 647.6917
1971 130 pce 701.0000
1972 158 pce 769.4333

Wide to long: melt

ggplot(df_long, aes(x = value, y = n)) + 
  geom_point() +
  geom_smooth() + 
  facet_wrap(~variable, scales = "free_x")

Text data: stringr functions

Which UFO observations refer to a color?

colors <- c(" red ", " orange ", " yellow ", " green ", " blue ", " purple ", " pink ", " white ", " black ")

ufo_colors <- ufo %>% 
  mutate(color_ans = str_detect(comments, colors)) %>% 
  filter(color_ans == T)
## Warning in stri_detect_regex(string, pattern, opts_regex = opts(pattern)):
## longer object length is not a multiple of shorter object length

Why the spaces?

Look at some comments

ufo_colors$comments[1:10]
##  [1] "1 object with green and red lights"                                                                                                     
##  [2] "Object silently traveled north -northwest. It was V shaped with five orange lights on the perimeter and one white ligh in the center of"
##  [3] "Observed 2 white clouds of identical shape in clear blue shy with object becoming visible beneath one cloud."                           
##  [4] "Silent&#44 oval&#44 bright white craft in yard."                                                                                        
##  [5] "12 ovel objects flying east to west orange and red at a slow speed then vanished"                                                       
##  [6] "Orange light flies overhead and turns black as it passed silently."                                                                     
##  [7] "Spherical.  Red&#44 yellow&#44 and green lights.  Below clouds&#44 yet very high up."                                                   
##  [8] "Bright white light and then 2 blinking red lights in Western Sky"                                                                       
##  [9] "Big extremely bright white light hovering in sky"                                                                                       
## [10] "Red circular object surrounded by yellow glow flying w/no sound quickly coming out of east turning southbound"

Extract the colors

colors_ans <- map(colors, function(color){str_extract(ufo_colors$comments, color)})

colors_df <- as.data.frame(colors_ans) 
colors_df <- colors_df %>%
  unite("colors_included", 1:ncol(colors_df), sep = ";") %>%
  mutate(colors_included = str_replace_all(colors_included, "NA;", "")) %>%
  mutate(colors_included = str_replace_all(colors_included, ";NA", ""))

head(colors_df)
##    colors_included
## 1     red ; green 
## 2  orange ; white 
## 3    blue ; white 
## 4           white 
## 5    red ; orange 
## 6           black

Add colors to original data

ufo_colors <- ufo_colors %>% mutate(colors = colors_df$colors_included)

Working with text data

“You never get better at regex; you just get better at googling.”

http://stringr.tidyverse.org

https://www.tidytextmining.com/

Put it together to answer a question

What colors are most common in UFOs, and is there a relationship between color and shape?

library(tidytext)
colors_df <- ufo_colors %>% 
  unnest_tokens(output = colors, input = colors, 
                token = stringr::str_split, pattern = ";") %>%
  mutate(colors = str_replace_all(colors, " ", ""))
colors_df %>% 
  select(date, colors) %>%
  head() %>% 
  kable()
date colors
1993-10-10 red
1993-10-10 green
1999-10-10 orange
1999-10-10 white
2007-10-10 blue
2007-10-10 white

Most common UFO colors

colors_df %>% 
  group_by(colors) %>% 
  count(sort = T) %>% 
  ggplot(aes(x = colors, y = n)) + 
  geom_col() + 
  coord_flip()

Looks like white UFOs are the most common

but red and orange show up a lot too!

Is there a relationship between color and shape?

First, we should just get the 10 most common shapes.

Challenge: How would you do that?

Most common shapes

shapes <- colors_df %>% 
  filter(!is.na(shape)) %>%
  group_by(shape) %>%
  count(sort = TRUE) %>%
  ungroup() %>%
  slice(1:10)

kable(shapes)
shape n
light 621
triangle 319
circle 317
fireball 248
sphere 206
unknown 200
other 164
oval 143
disk 116
formation 72

What’s wrong with this?

Keep only data with shapes in the top 10.

Challenge: How would you do that?

Keep only data with shapes in the top 10.

shape_colors <- colors_df %>%
  filter(shape %in% shapes$shape)

Relationship between shape and color:

ggplot(shape_colors, aes(x = shape, fill = colors)) + 
  geom_bar(position = "fill")

Review

What do each of these functions do?

  • mutate()

  • filter()

  • select()

  • group_by() %>% summarise()

  • str_select() or str_extract()

  • join()

Review: iteration

What are some approaches to iteration?

  • for loops

  • apply() functions

  • map() functions

  • in simple cases, may be replacable with group_by() %>% summarise()